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THE  MOMENTS  OF  MATCHED  AND  MISMATCHED  HIDDEN  MARKOV  MODELS 

I.  INTRODUCTION 

Hidden  Markov  models  (HMMs)  are  statistical  models  of  nonstationary  time 
series  or  signals.  In  speech  applications,  they  are  used  to  characterize 
the  time  variation  of  the  short  term  spectra  of  spoken  words.  A  specific 
example  is  the  speaker-independent  isolated  word  recognition  (SIIWR) 
problem,  where  HMMs  characterize  the  words  in  a  finite-size  vocabulary. 
Different  words  are  characterized  by  different  HMMs.1 

Every  HMM  is  comprised  of  two  basic  parts:  a  Markov  chain  and  a  set  of 
random  variables.  The  Markov  chain  has  a  finite  number  of  states,  and  each 
state  is  uniquely  associated  with  one  of  the  random  variables.  The  state 
sequence  generated  by  the  chain  is  not  observable;  i.e.,  the  Markov  chain  is 
"hidden."  At  each  time  t  =  0,  1,  2,  ...,  the  Markov  chain  is  assumed  to  be 

in  some  state;  it  transitions  to  another  state  at  time  t  4-  1  according  to 

its  transition  probability  matrix.  At  each  time  t,  one  observation  is 
generated  by  the  random  variable  associated  with  the  state  of  the  Markov 
chain  at  time  t.  The  observations  are  referred  to  as  symbols.  If  the 
random  variables  assume  only  a  finite  set  of  values,  the  HMM  is  referred  to 

as  a  finite  symbol  HMM.  If  the  random  variables  assume  a  continuum  of 

values,  the  HMM  is  called  a  continuous  symbol  HMM.  The  full  parameter  set 
defining  an  HMM  is  comprised  of  the  initial  state  probability  density 
function  of  the  Markov  chain  at  time  t  =  0,  the  Markov  chain  state 

transition  probability  matrix,  and  the  probability  density  functions  of  each 
of  the  random  observation  variables. 

The  act  of  computing  specific  numerical  values  for  the  various 
parameters  of  an  HMM  is  called  "training,"  and  training  is  equivalent  to 
solving  a  mathematical  optimization  problem  to  determine  maximum  likelihood 
estimates  of  the  HMM  parameters.  In  this  paper,  it  is  assumed  that  the 
training  phase  is  completed  and  that  the  HMMs  developed  are  adequate  models 
for  each  of  the  nonstationary  time  series,  or  signals,  of  interest  (e.g., 
the  vocabulary  words  in  the  SIIVnR  problem). 
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During  the  training  phase,  a  suitable  preprocessor  is  developed  to  map 
(or  transform)  an  arbitrary  input  signal  s(t),  t  >  0,  into  a  discrete 

observation  sequence  (0(t),  t  =  1,  2,  A  good  description  of  one 

way  this  is  done  for  the  SIIWR  problem  is  given  in  reference  1 

(pp.  1C-77-1078).  The  preprocessor  is  also  utilized  in  the  classification 
phase  as  depicted  in  figure  1.  The  observation  sequence  is  truncated  to  the 
length  T  required  for  the  HMMs ,  where  T  is  a  positive  integer.  The 
truncated  sequence  Oy  =  (0(t),  t  =  1,  2,  ....  T)  is  then  passed  to  the 
HMM  recognizers.  Each  HMM  recognizer  evaluates  the  posterior  likelihood 
that  0y  comes  from  a  time  series  characterized  by  that  HMM.  Denote  the 

i-th  hidden  Markov  model  by  HMM(i).  The  i-th  recognizer  thus  computes  the 
posterior  likelihood  function 

f.(Oy)  =  Pr[0y  |  Oy  c  H MM ( i ) ]  ,  i  =  1 . p  ,  (1) 

where  Oy  e  HMM(i)  denotes  the  hypothesis  that  the  observation  sequence 
Oy  is  a  realization  of  HMM(i).  The  maximum  of  the  p  computed  posterior 
likelihoods  is  assumed  to  identify,  or  classify,  the  original  signal  s(t). 

In  practice,  some  kind  of  tie-breaking  rule  must  be  defined  and  some 

threshold  must  be  set  to  identify  signals  for  which  HMMs  have  not  been 

2 

trained.  The  likelihood  function  (1)  can  be  computed  with  only  n  T 

multiplications  (where  n  is  the  number  of  states  in  the  Markov  chain)  by 

2 

using  the  forward-backward  algorithm. 

The  misclassif ication  rate  (or  false  alarm  rate)  of  the  system  depicted 
in  figure  1  can  be  estimated  by  simulation  after  training  is  completed. 
Alternatively,  the  misclassif  ication  rate  of  signal  i  as  signal  j  can  be 
determined  from  the  conditional  cumulative  distribution  functions 

Fy j(x)  =  Pr[f.(0y)  <  x  |  Oy  c  HMM ( j ) ]  (2) 

by  using  classical  detection  and  estimation  methods  to  develop  receiver- 

3 

operator  characteristics  (ROC)  curves.  The  validity  of  misclassif ication 
rates  based  on  F..j(x)  depends  on  the  validity  of  the  assumption  that  the 
HMMs  developed  are  adequate  models  for  the  signals  of  interest.  Agreement 
between  theory  and  simulation  would  support  the  hypothesis  that  the  HMMs 
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really  do  represent  the  signals.  Unfortunately,  algorithms  for  calculating 

F-jj(x)  directly  from  the  HMM  parameters  are  not  known.  For  later 

reference,  note  that  F.  (x)  *  F..(x)  in  general. 

i  j  J 1 

The  moments  of  dF^(x)  are  defined  by  the  Riemann-Stieltjes  integral 


Mi j ( k ,T) 


/ 


dFij(x) 


k  =  0,  1 ,  2, 


(3) 


If  F..(x)  is  differentiable  with  derivative  F..(x),  then  the  moments 
i  J  J 

can  be  written  equivalently  as  the  Riemann  integral 


M. . ( k ,T) 
1J 


—00 


The  moments  depend  on  the  length  T  of  the  observation  sequence  because 
F.j(x)  depends  on  T,  as  seen  from  equation  (2).  They  uniquely  determine 
dF..(x)  when  they  are  all  finite  and  the  characteristic  function  of 

1 J  4 

dF...(x)  has  a  positive  radius  of  convergence.  From  equations  (1)  and 

(2),  it  is  clear  that  dF_(x)  =  0  for  x  <  0  and  for  x  >  1.  Thus,  from 
equation  (3), 


M.  .  (k,T)  = 
ijv 


1 


dF . j ( x)  <  1 


so  that  all  the  moments  are  finite.  The  series 


w 

V“> 

v=0 


for  the  characteristic  function  of  dF.j(u)  is  absolutely  convergent  with 
an  infinite  radius  of  convergence  because,  for  fixed  Uq  *  0,  each  summand 
is  bounded  above  in  magnitude  by  |uq(u/u!  and  thus  the  radius  of 
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convergence  must  be  at  least  as  large  as  e x p ( | <*>q  I )  -  Consequently,  the 
moments  of  dF_(x)  uniquely  determine  dF„(x)  and,  thereby,  F_(x). 

This  paper  presents  an  algorithm  for  computing  explicitly  the  moments 
of  dF^(x)  up  to  any  desired  order  directly  from  the  given  underlying 
parameters  of  the  HMMs  involved.  The  only  essential  assumption  made  is  the 
usual  one  that  the  HMMs  have  a  finite  number  of  states.  However,  it  is  not 
required  that  all  HMMs  have  the  same  number  of  states. 

This  paper  also  presents  examples  that  compare  the  first  two 
theoretical  moments  with  simulation  results.  The  examples  are  of 
independent  interest  because  they  exhibit  important  features  of  posterior 
likelihood  classification  based  on  ergodic  and  left-to-r ight  HMMs  that 
theoretical  analysis  alone  would  not  show  as  easily  or  as  quickly.  These 
features  are  important  because  they  indicate  how  the  internal  structure  of 
HMMs  impact  the  performance  c  ‘  the  system  depicted  in  figure  1. 
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II.  THE  MOMENT  ALGORITHM 

The  reader  is  assumed  to  be  familiar  with  such  first  principles  of  HMMs 
as  given  in  reference  2.  It  is  not,  however,  necessary  to  read  this  section 
to  undei  stand  the  examples  provided  in  section  III. 

A.  FINITE  SYMBOL  HMMs 

Let  HMM(v)  be  a  hidden  Marxov  process  with  n(v)  states,  u  =  1,  .... 
Subscripted  indices  will  always  be  written  as  functions  of  their  subscripts 
(for  instance,  n(u)  is  used  instead  of  n  )  to  avoid  the  later  use  of 

V 

subscripted  subscripts.  Let  the  state  transition  probability  matrix  of 

HMM(u)  be  denoted  as  Au  =  [a”  .  ..  for  i(u),  j(v)  -  I,  .... 

*  l  u 

n(v).  Let  the  initial  state  probability  vector  of  HMM(u)  be  denoted  as 

for  i(«)  *  1 . n(„). 

We  first  restrict  attention  to  finite  symbol  HMMs;  that  is,  we  suppose 
that  every  observation  sequence  Oy  =  (0(t),  t  -  1 . T}  is  such  that 

0(t)  c  V  =  {V1 . Vmj  , 

where  V  is  the  set  of  all  possible  output  symbols  of  the  preprocessor .  The 

true  nature  of  the  symbols  in  V  is  of  no  importance  here.  HMMs  assume  that 

0(t)  is  a  random  variable  whose  probability  density  function  depends  on  the 

current  state  of  the  Markov  process.  Let  the  discrete  probability  density 

function  for  HMM( i>)  when  it  is  in  state  i(u)  be  denoted  as  B?.  .,  for 

i(v)  =  1 . n(u).  Thus,  each  bV.  .  is  a  row  vector  of  length  m. 

U») 

Stacking  these  row  vectors  gives  the  n(v)-by-m  symbol  probability  matrix 
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Note  that 


bi(w)(Vj(v)}  "  bi(u),j(t>)  ’ 

where  we  define 

bi(v)(0(t})  =  PrC°('t)  I  HMM(u)  and  Markov  state  =  i(v)]  . 

The  assumption  that  the  training  phase  is  completed  means  that  the 
parameters  \u  =  (*”,  Av,  Bv)  are  known. 

For  discrete  symbol  HMMs ,  the  cumulative  function  F^(x)  is  an 

increasing  step  function  with  a  finite  number  of  jump  discontinuities.  Let 

X..  denote  the  set  of  all  values  of  x  for  which  F .  .  ( x )  is 

ij  .  .  .  1J 

discontinuous.  It  follows  that  dF„(x)  =  0  if  x  is  not  in  X„  and  that, 

for  x  in  X . . , 
iJ 

^(x)  =  F,j(xO  -  F,3(x-> 

•  Pr[f  (0T)  •  x  |  0T  t  HMM(  j ) ]  .  (4) 

Substituting  equation  (4)  into  equation  (3)  gives 

Mij(k,T)  =  ^  xk  Pr[f.(0T)  =  x  |  0T  c  HMM(j)] 

XcXij 

=  {f.(0T)}k  Pr[0T  |  0T  e  HMM(j)] 

°T 

-  ^{Pr[0T  1  0T  c  HMM(  i )  ]}k  Pr[Oy  |  0T  c  HMM(j)]  .  (5) 

°T 

It  is  clear  from  equation  (5)  that  M_(k,T)  *  M_(k,l)  in  general  for 

k  >  1.  For  k  --  1,  however,  we  have  M.^(l.T)  -  M..(l,l)  for  all  i,  j, 

and  I . 
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The  expression  in  equation  (5)  is  computable  directly  from  the 
parameters  of  HHM(i)  and  HMM(j);  however,  such  a  calculation  is  not 
practical  except  for  small  T  because  the  computational  effort  increases 

exponentially  in  T.  To  see  this,  note  that  the  forward-backward 

7  2 

algorithm  calculates  Pr[0T  |  0T  e  HMM(u)]  using  n  (u)  T  multipli- 

2  2 

cations.  Thus,  each  summand  in  equation  (5)  requires  k[n(i)  n(j)]  T 

multiplications.  There  are  mT  different  possible  observation  sequences 

0-j.  =  [0(t),  t  =  1 .  T}  because  each  0(t)  can  be  any  one  of  the  m 

output  symbols  in  V.  Thus,  direct  calculation  of  equation  (5)  requires  a 

2  2  T 

total  of  k[n(i)  n(j)]  T  m  multiplications. 

We  now  derive  a  recursion  for  equation  (5)  that  requires  computational 
effort  that  grows  only  linearly  with  T.  The  recursion  is  derived  for  a  more 
general  expression  that  contains  equation  (5)  as  a  special  case.  For  k  =  1, 
2,  ....  define 

k 

R(k,T)  =  £11  Pr[0T  |  0T  c  HMM(v)]  .  (6) 

0T  v=l 

The  application  of  equation  (6)  to  compute  any  moment  from  equation  (5)  is 
straightf orward ;  for  example,  R(k+1  ,T)  equals  M^Ck.T)  when  HMM(2)  =■  ...  = 
HMM( k+1 ) .  Note  that  R(k,T)  can  be  interpreted  as  a  joint  moment  of  HMMs . 


The  derivation  of  the  recursion  for  R(k,T)  proceeds  as  follows.  The 
forward  recursion  portion  of  the  forward-backward  algorithm  gives  the 
expression 

n(v) 

Pr[0T  |  0T  c  HMM{ v)  ]  =  ^  o*(j(®))  ,  (7) 

j(v)=l 


where,  for  2  <  t  <  T, 


and 


n(w) 

i(v)=l 


b"  ,(0(t))  , 
J(«) 


(8) 


^(j(V))  -  *J(W)  S(v)(0(1))  • 
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Substitute  equation  (7)  into  equation  (6)  to  obtain 

n(  i>)  k 

R(k.T)  =  ^  S  I!  ttT(j(u)) 

j(«)=l  oT  t>=l 
u=1 , . . .  ,k 

n(t>) 

=  ^  vT(j(i) . j(k))  ,  (10) 

j(w)=l 

v=l , . . . , k 

where  we  define  for  t  =  1,  ....  T 

k 

ut(j(D,  ....  j  (k) )  =  En  a”(j(v))  .  (11) 

0t  t>=l 


One  interpretation  of  uT  is  that  it  equals  R(k,T)  given  that  HMM(v)  must 
.nd  in  state  j(v),  u=l ,  ....  k.  We  seek  a  recursion  for  .  First 
suppose  that  2  <  t  <  T.  Then,  substituting  equation  (8)  into  equation  (11) 
gives 


ut(j(l) . j(k) )  = 


k  n(  v) 

En  E 

0  v=l  I i ( v ) =1 


n(  v) 

£  £ 

i(«)=l  0 

v=l . k 


1 — 

1 _ 

i — 

i _ 

— 1 

_ 1 

I  n  at-i  ^  ^ 

1  i  aHv),j(v) 

n 

j  -v=l 

Lv=i 

U=i  J 

n(  v) 


i(»)=l 

v=l,...,k 


I  1  3i  ( «) .  j  (  v) 


V=1 


n 


c^OU)) 


0t  Lv=i 


n 

v~  i 
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Because  <*^_-|(i(»))  does  not  depend  on  the  last  symbol  0(t)  in  the 
observation  sequence  0  =  [0(1),  ....  0(t)},  we  have 


PtU(l),  ....  3(k) ) 


Because  the  sum  over  0(t)  is  independent  of  the  observation  sequence  0^  1 
=  (0(1),  ....  O(t-l)},  as  well  as  the  indices  i(v),  and  because  of 
equation  (11),  we  have 


vt( j(l ) ,  • • • ,  j(k) ) 


n( «)  r  k 

■r^"> . j(k»  E  n 

i  (w)=l  Lv=l 
v=l , . . .  ,k 


where 


k 

r(j(i) . j  (k) )  =  En  bJ(„)(0(t)> 

0(t)  u=l 


m  k 

-  Eli  bi(„)<vs>  ■  <i3> 

S=1  v=l 

Equation  (12)  is  the  desired  recursion  for  2  <  t  <  T.  For  t  =  1, 
substituting  equation  (9)  into  equation  (11)  gives 

k 

w-|  ( j ( 1 ) . j(k) )  =  En  a”(j(v)) 

0(1)  v=1 
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-(n  -1,4  e  fr 

V=i  /  o(  i  >  o=i 

k 

=  r(j(i) . j(k))  n  wj(«)  ' 

0  =  1 

Let  k  =  1.  From  the  definition,  it  is  clear  that  R(1,T)  =  1  for  all  T, 
regardless  of  HMM(l),  because  the  sum  in  equation  (6)  is  over  all  0T .  To 
independently  check  the  recursion  (12)— (13) ,  note  that,  from  equation  (13), 

m 

r(j(l))  =  ^  bj(l)(vs}  =  1  •  1  i  1  i  T  • 

s=l 


From  equation  (14),  we  have 


wl(j(1))  *j(l)  ' 

Hence,  from  equation  (10),  we  obtain 

n(l ) 

R0.o  -  £  *](>)  ■  ’  ■ 

j(l)=l 

The  recursion  is  verified  for  T  =  1.  For  T  =  2,  from  equation  (12),  we  have 

n(l ) 

u2(j(l))  =  ai(l)J(l) 

i (1 )=1 

n(  1 ) 

=  S  *1(1)  3i ( 1 )  •  j( 1 ) 

i(l )=1 
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so  that,  from  equation  (10), 


RO 


n(l)  | 

n(l) 

E  ' 

E 

i 

"KD 

j(D=i  | 

i (1 )=1 

{ 

n(l )  I 

f 

n(l) 

£ 

j"id) 

E 

i(l)=l  | 

f 

jd)=i 

1  , 

Jd) 


,j(D 


and  the  recursion  is  verified  for  T  =  2. 


The  first  nontrivial  special  case  is  k  =  2.  In  this  case,  R(2,T)  is 
identically  the  first  moment  M  (1,T).  From  equation  (12),  we  have  for 
2  <  t  <  T 


PtU(l),j(2))  =  r(j(l),j(2)) 


n(l)  n( 2) 

£  £ 


i (1 )=1  i (2)=1 


vt.1(Kl),i(2)) 


a 


1 

id)  Jd) 


2 

ai(2),j(2) 


and,  from  equation  (14), 

U1(j(D,j(2))  =  T(  j(l )  ,  j(2) )  ir](1)  trj(2) 


where,  from  equation  (13), 

m 

r(jO) . j(2»  -E"J(i)<V  bj(2)<V  ' 

s=l 


From  equation  (10),  then,  we  have 

n(l)  n( 2) 

R(2,T)  =  ^  ^  uT( j (1 ) . j( 2) )  . 

j(D  =  l  3 ( 2 )  =  1 
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Computation  of  R(2,T)  =  M-^O.T)  is  therefore  not  excessively  laborious. 

The  evaluation  of  R(k,T)  using  the  recursion  (12)  is  properly  broken 

into  two  parts.  The  first  is  the  precalculation  of  r(j(l),  ...,  j(k))  for 

|( 

every  possible  value  of  the  indices  j(v).  This  requires  (k-l)m  N 

k 

multiplications  and  N  storage  locations,  where 


is  the  geometric  mean  of  the  number  of  different  states  in  the  various  HMMs 
and  is  not  necessarily  an  integer.  If  N  =  8  and  if  there  are  m  =  16 
different  observation  symbols,  then  computing  and  storing  r  for  k  =  6 
requires  262144  storage  locations  and  2.1  x  107  multiplications.  Storage 
is  clearly  more  crucial  an  issue  than  multiplications. 

It  is  possible  in  some  cases  to  utilize  the  underlying  symmetries  of  r 
to  reduce  both  storage  and  computational  effort.  For  example,  if  HMM(  2 )  = 
...  =  HMM(k-t-l),  then 

r(j(l )  ,j(2) ,  ....  j(k+l))  =  r(j(l),  o( j ( 2) ) . o(j(k+l)))  (16) 

for  every  permutation  a  of  the  k  integers  j(2),  ....  j(k+l).  The  proof  of 
equation  (16)  follows  easily  from  equation  (13)  because  multiplication  is 
commutative.  Thus  one  only  need  consider  indices  that  satisfy 

1  s  j(l)  <  n(l)  and  1  <  j(2)  <  j(3)  <  ...  <  j(k+l)  <  n( 2)  . 

The  number  of  ordered  index  sets  (j(v)}  is  equal  to  the  number  of 
combinations  of  n ( 2 )  letters  taken  k  at  a  time,  when  each  letter  may  be 
repeated  any  number  of  times  up  to  k.  Storage  is  therefore  proportional  to 
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k. 

which  is  significantly  smaller  than  the  [n(2)]  n(l)  storage  that  would 
otherwise  be  necessary.  The  total  multiplication  count  is  also  reduced 
proportionately. 

Once  T  has  been  computed  and  stored  for  a  given  value  of  k,  the 
recursion  (12)  can  be  computed  for  any  length  T  of  the  observation 

k 

sequence.  For  each  of  the  N  sets  of  indices  { j  ( u ) }  in  equation  (12), 

k 

the  sum  over  all  N  indices  (i(u)}  must  be  undertaken.  This  sum 

appears  to  require  k  N  multiplications;  however,  by  using  the  nested  form, 


n  ( 1 ) 

r  n  ( 2 ) 

"  n(k) 

E 

1 

ai(l),j(D 

E 

S  ai(k),j(k)  1Jt-l(i(1)*  • 

...  i ( k) ) 

i (1 )=1 

_i(2)=l 

_i  (k)=l 

- 

- 

it  is  possible  to  use  approximately 


N  +  N 


k-1 


+  N  +  N 


■(*) 


<Nk  - 


1) 


instead.  If  lower  order  terms  are  neglected,  computing  one  iteration  of 

2k 

equation  (12)  requires  about  N  multiplications.  For  an  observation 

2k 

sequence  of  length  T,  computing  yT  requires  on  the  order  of  N  T 

1  2 

multiplications.  If  N  =  8  and  T  =  32,  then  2.2  x  10  multiplications  are 
required  for  k  =  6.  Assuming  a  multiplication  takes  one  microsecond,  the 
calculation  requires  611  hours  and  is  clearly  impractical. 


Significant  reduction  in  computational  effort  is  possible  in  some  cases 
by  utilizing  the  underlying  symmetries  in  y^.  For  example,  if  HMM(2)  = 
. . .  =  HMM(  ki-1 ) ,  then 

vt(j(D,j(2) . j(k+l ))  =  vt(j(l),  o(j(2)) . o(j(k+1)))  (17) 

for  every  permutation  a  of  the  k  integers  j(2),  ...,  j(ki-l).  The  proof  of 
equation  (17)  follows  easily  by  induction  from  equation  (12)  because 
multiplication  is  commutative  and  because  r  satisfies  the  same  symmetry 
property  (16)  in  this  case.  Thus,  the  recursion  (12)  need  be  computed  for 
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only  N.  sets  of  indices.  The  total  multiplication  count  is  reduced  to 
o  K+’l  2k 

4Nr+-|T,  which  is  significantly  smaller  than  the  N  T  multiplications 

that  would  otherwise  be  needed.  For  the  above  example  requiring  611  hours, 

if  n  =  n(l)  =  n(2)  =  8  and  if  the  symmetry  (17)  is  utilized,  the  calculation 

would  be  reduced  to  roughly  a  96-minute  calculation.  Utilizing  symmetry  is 

clearly  significant  in  that  it  can  turn  an  impractical  long  calculation  into 

a  feasible  shorter  one. 


Underflow  is  potentially  a  problem  when  the  recursion  (12)  is  computed. 
It  can  be  easily  overcome  in  exactly  the  same  manner  as  pointed  out  in 
reference  2  for  preventing  numerical  underflow  during  the  calculation  of  the 
f orward-backward  algorithm.  Specifically,  let  be  computed  according  to 
equation  (12)  and  then  multiplied  by  a  scale  factor  c^  defined  by 


c 


t 


n(  v) 

E 


0(v)=1 
Lvsi , . . . , k 


vtUO), 


....  j (k) ) 


(18) 


Then  use  the  scaled  values  of  yt  in  the  recursion  (12)  to  compute  . 
which  is  in  turn  scaled  as  shown  in  equation  (18).  If  we  continue  in  this 
fashion  and  recall  the  expression  in  equation  (10),  it  follows  that 


R(k,T)  = 


(19) 


Because  the  product  cannot  be  evaluated  without  underflow,  we  compute  instead 


T 

log  R(k,T)  =  log  ct  . 

t=l 


(20) 


Any  convenient  scale  factor  can  be  used  instead  of  equation  (18).  A 

_  k  _ 

potentially  useful  one  might  be  to  take  ct  =  N  .  Using  ct  would 
eliminate  the  effort  of  computing  the  sum  in  equation  (18)  before  scaling. 
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B.  CONTINUOUS  SYMBOL  HMMS 

The  objective  of  this  section  is  to  show  that  the  moment  algorithm  for 
discrete  symbol  HMMs  can  be  carried  over  essentially  unchanged  to  continuous 
symbol  HMMs.  In  fact,  it  holds  also  for  continuous  vector  symbol  HMMs; 
however,  only  the  continuous  symbol  HMMs  are  treated  here  for  simplicity. 

Throughout  this  section,  it  is  assumed  that  each  output  symbol  0(t)  is  a 
real  random  variable  defined  on  some  underlying  event  space,  V.  The 
probability  density  function  of  0(t)  is  uniquely  defined  for  each  state  i(v) 

=  1 . n(u)  of  each  HMM(v),  v  =  1,  2,  ...,  and  is  denoted  as 

bV( w) ( x) -  Thus,  for  real  numbers  a  and  B  with  a  <  B,  we  have 


b”(u)(x)  dx  =  Pr[a  <  0(t)  <  B  |  HMM(v) 


and  Markov  state  =  i(o)]  . 


(21) 


An  observation  sequence  0T  =  (xt,  t  =  1,  2 . T}  is  a  sequence  of 

real  numbers  x^,  with  x^.  being  a  realization  of  the  random  variable 
0(t).  The  posterior  likelihood  function  f  (0^)  is  now  a  probability 
density  function  for  continuous  symbol  HMMs,  as  opposed  to  a  simple 
probability  (see  equation  (1))  for  discrete  symbol  HMMs.  Thus,-  for  real 
vectors  o'  and  B  with  a*  <  B,  we  have 


V°T>  d0T 


Pr[a  <  0T  <  B  |  0T  e  HMM(v) ]  , 


(22) 


where  0^.  e  HMM(v)  denotes  the  hypothesis  that  Oj  is  a  realization  of 
HMM(u)  and  d0^  =  dx1  ...  dx^.. 

The  conditional  cumulative  distribution  functions  F-jj(x)  are  defined 
by  equation  (2),  just  as  for  discrete  symbol  HMMs.  From  equation  (3),  we 
have  the  moments 
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Vk'T)  “  / 


x-  OF, j(x) 


■/. 


x  Pr[x  <  f . ( 0^ )  <  x  +  dx  |  0-j.  e  HMM(j)]  dx 


/®  p  OO 

•••  /  (fi(0T)}k  fj(0T)  dOT  , 


(23) 


— CO  —OO 


T-f old 


which  is  the  continuous  analog  of  equation  (5).  It  is  clear  from 
equation  (23)  that  M..(k,T)  =  M^.(k.T)  in  general  only  for  the  special 
case  k  =  1.  The  analog  of  equation  (6)  for  continuous  HMMs  is 


R(k.T) 


/®  *  oo  k 

•••  /  n  w  -t 

-00  •'-00  1 


T-fold 


(24) 


The  forward-backward  algorithm  for  computing  the  posterior  likelihood 

5 

function  for  continuous  HMMs  is  modified  as  follows: 

n(v) 

W  =  ^  ^(j(v))  ,  (25) 

j(v)=l 


where  a”(j(v))  is  computed  exactly  as  given  by  the  recursions  (8)  and 

(9),  with  the  only  difference  being  that  b”,  ,(0(t))  in  equation  (8)  is 

J  ( v) 

now  interpreted  as  the  probability  density  function  implicit  in 
equation  (21).  Consequently,  equation  (10)  still  holds  exactly  if  we  define 
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vt(3(l) . j(k))  = 


/“  r  “  k 

...  n  *?»<»»  d°t 

— <30  — OO 


t-fold 


as  the  analog  of  equation  (11).  Proceeding  as  before  with  t-fold  integrals 
replacing  t-fold  summations  gives  exactly  the  recursion  (12),  but  with  the 
one-dimensional  integral 


r(j(l). 


j(k))  = 


/®  k 

n 


( x)  dx 


in  place  of  equation  (13). 


The  remarks  in  the  preceding  section  concerning  storage,  multiplication 
counts,  and  symmetry  properties  all  apply  for  continuous  symbol  HMMs .  The 
primary  difference  is  that  equation  (27)  requires  an  integral  evaluation 
instead  of  a  finite  sum  as  in  equation  (13).  This  evaluation  increases  the 
initial  computational  overhead,  but  once  equation  (27)  is  computed,  the 
algorithm  (12)  proceeds  exactly  as  before. 
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III.  COMPARISON  OF  THEORETICAL  MOMENTS  WITH  SIMULATION 

Ergodic  Markov  chains  are  those  for  which  it  is  possible  to  transition 
from  every  state  to  every  other  state,  although  not  necessarily  in  one 
step.  Left-to-right  Markov  chains  are  those  for  which  transitions  to  lower 
numbered  states  are  not  allowed,  that  is,  have  probability  zero.  These  two 
types  of  chains  are  sufficiently  different  that  they  are  considered 
separately  in  the  examples. 

One  interpretation  is  that  ergodic  HMMs  are  models  of  quasi -stationary 
signals,  while  left-to-right  HMMs  are  models  of  transient  signals  that 
ultimately  become  stationary  (because  the  highest  numbered  state  is  not 
exited  once  it  is  entered).  One  might  therefore  expect  these  two  types  of 
HMMs  to  impact  the  performance  of  the  maximum  likelihood  classifier  depicted 
in  figure  1  in  different  ways.  The  three  examples  given  in  this  section 
support  this  expectation. 

Using  the  above  interpretation,  the  examples  may  be  described  as 
follows.  The  first  example  shows  that  maximum  likelihood  classification 
based  on  HMMs  can  reliably  distinguish  between  sufficiently  long  quasi¬ 
stationary  signals  with  a  reasonable  amount  of  computational  effort.  The 
second  example  shows  that  short  quasi -stat i onary  and  transient  signals  look 
significantly  different  to  the  HMM  transient  recognizer,  but  not  to  the  HMM 
recognizer  based  on  the  quas i -stationary  signal.  The  third  example  shows 
that  noisy  observations  of  transient  signals  adversely  affect  classification 
performance  by  making  the  transient  signal  appear  to  have  a  stationary 
component,  which  is  then  misclassif ied  by  the  HMM  transient  recognizer. 

A.  TWO  ERGODIC  HMMS 

HMM( 1 )  and  HMM(2)  are  five-state,  eight-symbol  ergodic  models  whose 
parameters  are  given  (rounded  to  three  significant  decimals)  in  tables  1 
and  2,  respectively.  HMM(l)  clearly  generates  observation  sequences  of 
uniformly  distributed  symbols.  HMM(2)  is  more  complex  in  structure,  but 
every  symbol  can  be  generated  in  every  state.  The  fundamenta1  question  ot 
interest  here  is  the  following.  How  long  must  an  observation  sequence  be  to 
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guarantee  that  maximum  posterior  probability  classification  (described  in 
the  Introduction)  will  be  98  percent  reliable?  We  will  give  what  may  best 
be  described  as  a  semiempi rical  answer  to  this  question. 

Because  of  the  nature  of  HMM(l),  it  is  easy  to  see  that 
f 1 ( 0y)  =  Pr[0T  I  0T  e  HMM(l)]  =  8~T  . 

In  other  words,  the  posterior  likelihood  function  based  on  HMM(l)  is 
constant  because  all  observation  sequences  are  equally  likely  if  0^.  e 
HMM(l).  In  particular,  f^O^)  cannot  distinguish  0^  c  HMM(l)  from 
0^  c  HMM( 2 )  and  thus  is  useless  for  c lass i f icat ion  . 

The  posterior  likelihood  function  based  on  HMM(2),  instead  of  HMM(l),  is 
useful  for  classification.  Ten-thousand  observation  sequences  0^  of  each 
HMM  were  generated,  and  the  posterior  probability  f^O^)  waS  cornPu'tecl 

Table  1.  Parameters  of  HMM(l) 


NUMBER  OF  MARKOV  STATES  =  5 
NUMBER  OF  SYM80LS  PER  STATE  =  8 
INITIAL  STATE  PROBABILITY  VECTOR: 


2 

o 

o 

m 

-01 

2 

LaJ 

O 

o 

-01 

2.00E 

-01 

2 

o 

o 

m 

-01 

2 

.  00E 

-01 

TRANSIT  ION 

PROBABILITY  MATRIX: 

2 

.  00E 

-01 

2 

.00E 

-01 

2.00E 

-01 

2 

.  00E 

-01 

2 

.  00E 

-01 

2 

.  00E 

-01 

2 

.00E 

-01 

2.00E 

-01 

2 

.  00E 

-01 

2 

.  00E 

-01 

2 

.00E 

-01 

2 

.  00E 

-01 

2.00E 

-01 

2 

.  00E 

-01 

2 

.  00E 

-01 

2. 

.  00E 

-01 

2 

.  00E 

-01 

2.00E 

-01 

2 

.  00E 

-01 

2 

.  00E 

-01 

2 

.  00E 

-01 

2 

.  00E 

-01 

2.00E 

-01 

2 

.  00E 

-01 

2 

.  00E- 

-01 

SYMBOL 

PR0BA8ILI 

TY  MATRIX 

(TRANSPOSED) : 

1 

.  25E 

-01 

1 

.  25E 

-01 

1 .25E 

-01 

1 

.  25E 

-01 

1 

.  25E 

-01 

1 

.  25E 

-01 

1 

.  25E 

-01 

1  .25E 

-01 

1 

.  25E 

-01 

1 

.  25E 

-01 

1 

.  25E 

-01 

1 

.  25E 

-01 

1  .25E 

-01 

1 

.2-E 

-01 

1 

.  25E 

-01 

1 

.  25E 

-01 

1 

.  25E 

-01 

1  .25E 

-01 

1 

.  2  5  E 

-01 

1 

.  25E 

-01 

1 

.  25E 

-01 

1 

.  25E 

-01 

1  .25E 

-01 

1 

.  25E 

-01 

1 

.  25E 

-01 

1 

.  25E 

-01 

1 

.  25E 

-01 

1  . 25E 

-01 

1 

.  25E 

-01 

1 

ro 

m 

-01 

1 

.  25E 

-01 

1 

.  25E 

-01 

1  . 25E 

-01 

1 

.  25E 

-01 

1 

.  2  5  E 

-01 

1 

.  25E 

-01 

1 

.  25E 

-01 

1  . 2  5E 

-01 

1 

.  25E 

-01 

1 

.  2  5  E 

-01 
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Table  2.  Parameters  of  HMM(2),  Rounded  to 
Three  Significant  Digits 


NUMBER  OF  MARKOV  STATES  =  5 
NUMBER  OF  SYMBOLS  PER  STATE  =  8 
INITIAL  STATE  PROBABILITY  VECTOR: 


1  .00E4-00 

O.OOEfOO 

0 .  OOE-t-OO 

0. OOE-t-OO 

0.00E+00 

TRANSITION 

1 .40E-01 

PROBABILITY  MATRIX: 
2.35E-01  3.08E-07 

1 .24E-01 

1 . 94E-01 

1 .40E-01 

1 . 1 4F  -0. 

2.99E-01 

2.13E-01 

2.34E-01 

4.37E-02 

3.20E-01 

1 . 72E-01 

1 . 27E-01 

3.38E-01 

9.73E-02 

4.97E-01 

1 . 53E-02 

1  .15E-01 

2.75E-01 

2.36E-01 

2.49E  -02 

4.27E-01 

2.82E-01 

2.98E-02 

SYMBOL  PROBABILITY  MATRIX  (TRANSPOSED): 


1 

.  81 E 

-01 

1 

.  22E 

-01 

7 

.  89E- 

-03 

1 

-p 

OD 

m 

-01 

7 

o 

-P 

m 

-02 

1 

.  39E 

-01 

8 

.  28E 

-02 

3 

.  23E 

-02 

9 

.  1  3E- 

-02 

1 

.  33E 

-01 

2 

.  67E 

-02 

1 

.60E 

-01 

5 

.  87E 

-02 

1 

.  08E 

-01 

2 

.  34E 

-01 

1 

.  79E 

-01 

1 

.  66E 

-01 

2 

.  1 8E 

-01 

1 

.  30E 

-01 

5 

.  97E 

-02 

1 

.  56E 

-01 

1 

.  58E 

-01 

2 

.  1  5E 

-01 

2 

.09E 

-01 

2 

.  35E 

-01 

1 

.  1 9  E  - 

-01 

5 

.  75E 

-02 

1 

.HE 

-01 

1 

.  02E- 

-01 

1 

.  03E 

-01 

1 

.  76E 

-01 

1 

.  32E 

-01 

2 

.40E 

-01 

6 

.  61 E 

-02 

1 

.  76E 

-02 

2 

.  3  7  E 

-02 

1 

.221 

-01 

1 

.17E 

-01 

1 

.  46E 

-01 

1 

.  47E 

-01 

using  the  forward 

-backward  algorithm.  Figure  2 

shows 

a  histogram 

of 

the 

natural  logarithm 

of 

dF22(x)  for  T  =  25. 

The 

observation  sequences 

are 

thus  matched  to 

the 

posterior  likelihood 

function. 

Figure  3 

shows 

a 

histogram  of  log  dF 

2 1  ( x )  for  T  =  25. 

In 

f igure 

3,  then. 

°T 

i  s 

mismatched  to  the  likelihood  function.  As  is  clear  from  figures  2  and  3, 
the  difference  between  the  mean  values  of  the  log  likelihood  functions  is 
about  1.4  standard  deviations.  Thus,  the  potential  exists  for  using 
log  dF ( x )  to  classify  observation  sequences;  however,  T  -  25  is  not  long 
enough  to  classify  with  high  reliability. 

■Vi*' 

A  useful  observation  drawn  from  figures  2  and  3  is  that  the  probability 

density  function  of  log  dF^j(x)  is  nicely  approximated  by  the  normal 

distribution.  Let  y.  .  and  a.,  denote  the  mean  and  standard  deviation  of 

iJ  ij 

log  dF_(x).  Then,  if  dF_(x)  is  log-normal,  it  is  easy  to  show  that 

v..  and  a.,  are  related  to  the  moments  M..(k,T)  by  the  formulas 
ij  ij  ij 
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v..  =  2  log  M.jO.T)  -  (1/2)  log  M.j(2,T)  (28) 

o-j  =  log  M.j(2,T)  -  2  log  M.^l.T)  .  (29) 

It  is  stressed  that  equations  (28)  and  (29)  hold  exactly  if  and  only  if 
dF^.(x)  is  truly  log-normal.  For  finite  symbol  HMMs ,  dF^(x)  is 

necessarily  discrete,  so  that  both  equations  (28)  and  (29)  must  be  viewed  as 
approximations.  Sufficient  conditions  under  which  it  may  be  proved  that 
dF„(x)  is,  in  some  sense,  approximately  log-normal  are  unknown. 

Table  3  gives  a  comparison  between  the  mean  and  standard  deviations  of 
log  dF^.(x)  estimated  from  10000  observation  sequences  0^  and  those 
calculated  from  equations  (28)  and  (29).  This  table  shows  good  agreement 

between  the  approximations  of  equations  (28)  and  (29)  and  the  sample  means 
and  variances.  It  also  establishes  that  observation  sequences  of  length 
T  =  400  are  long  enough  to  distinguish  between  0^  c  HMM(l)  and  0(  e 

HMM( 2)  with  high  reliability.  That  is,  the  difference  between  the  mean 

value  of  log  dF.^(x)  and  the  mean  value  of  log  dF^x)  is  about  5.2 
standard  deviations.  Assuming  log  dF21(x)  and  log  dF^tx)  are  normally 
distributed,  as  they  appear  to  be,  then  the  reliability  of  the  maximum 
posterior  classifier  is  about  98  percent. 

Computing  the  posterior  likelihood  function  f  (0T)  for  T  -  400 

requires  n  T  =  10000  multiplications.  This  is  about  the  same  level  of 

effort  as  computing  one  512-point  FFT,  which  requires  (2) (51 2) ( log^  512)  - 
9216  multiplications.  In  other  words,  the  computational  requirements  of 
f  2 ( °400 )  are  sma^  enough  for  practical  application.  Furthermore,  the 
f  orward-backward  algorithm  for  computing  f^O^)  mathematically 

equivalent  to  a  nested  sequence  of  matrix-vector  multiplications. 
Consequently,  it  is  possible  to  reduce  total  computation  time  by  the  design 
of  a  "black  box"  to  exploit  this  special  structure  in  hardware. 

B.  MIXE0  ERG0DIC  AN0  I.EF  T-T0-RIGHT  HMMS 

HMM(3)  is  a  five-state,  eight -symbol  lef t -to -right  model  whose 

parameters  are  given  in  table  4.  It  has  a  structure  that  might  conceivably 
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Table  3.  Comparison  of  Two  Estimates  for  the 
Mean  and  Standard  Deviation  of  log  dF2j(x)  for  j  =  1,  2 


Mean  Value 

Standard 

Deviation 

T 

Sample  Eq.  28 

Sample 

Eq.  29 

5 

-10.8 

-10.6 

0.95 

0.71 

10 

-21  .3 

-21  .2 

1.11 

0.92 

15 

-31.9 

-31  .8 

1  .24 

1  .10 

20 

-42.4 

-42.4 

1  .35 

1.25 

25 

-53.0 

-52.9 

1  .47 

1  .38 

50 

-105.8 

-105.8 

1  .93 

1  .91 

100 

-211  .4 

-211.5 

2.62 

2.67 

200 

-422.6 

-423.0 

3.60 

3.76 

400 

-845.0 

-845.8 

5.09 

5.30 

5 

-10.1 

-10.1 

0.69 

0.59 

10 

-20.3 

-20.3 

0.90 

0.84 

15 

-30.6 

-30.5 

1  .08 

1  .03 

20 

-40.8 

-40.8 

1  .23 

1  .20 

25 

-51  .0 

-51  .0 

1  .37 

1  .34 

50 

-102.1 

-102.1 

1  .92 

1  .90 

100 

-204.4 

-204.4 

2.66 

2.69 

200 

-408.9 

-409.0 

3.77 

3.80 

400 

-818.0 

-818.1 

5.33 

5.37 

arise  in  the  SIIWR  problem.  Note  that  HMM(3)  never  leaves  the  fifth  state 
once  it  is  entered.  Consequently,  all  sufficiently  long  observation 
sequences  ultimately  contain  only  the  three  symbols  Vc  ,  V.,,  and  V0. 

0  i  o 

Note  also  that  the  symbol  V  occurs  if  and  only  if  the  fifth  state  has 

0 

been  entered.  It  follows  that  an  observation  sequence  0^  containing  the 
symbol  V  and  subsequently  containing  any  of  the  five  symbols  V  ,  V  , 

O  l  C 

V^,  ,  or  must  have  posterior  likelihood  zero;  i.e., 

f^O-j.)  =  0.  Other  forbidden  symbol  sequences  may  also  be  noticed.  It 
will  be  seen  that  these  facts  make  a  powerful  discriminator 

against  ergodic  observation  sequences.  To  summarize  briefly,  this  example 

will  show  that  short  observation  sequences  of  quasi -stationary  and  transient 
HMMs  look  very  different  to  the  transient  HMM  recognizer.  On  the  other 

hand,  all  observation  sequences  look  somewhat  alike  to  ergodic  HMM 
recogni zers . 

When  HMM( 3)  enters  its  fifth  state,  it  becomes  stationary  and. 
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Table  4.  Parameters  of  HMM(3) 


NUMBER  OF  MARKOV  STATES  =  5 
NUMBER  OF  SYMBOLS  PER  STATE  =  8 
INITIAL  STATE  PROBABILITY  VECTOR: 


1 .00E+00 

0.00E+00 

O.OOE+OO 

O.OOE+OO 

O.OOE+OO 

TRANSITION 

6.00E-01 

PROBABILITY  MATRIX: 
4. 00E-01  0.00E+00 

O.OOE+OO 

O.OOE+OO 

0.00E+00 

7. 00E-01 

2. 00E-01 

1  .OOE-Ol 

O.OOE+OO 

0.00E+00 

0.00E+00 

6. 00E-01 

4. OOE-Ol 

O.OOE+OO 

O.OOE-t-OO 

0.00E+00 

O.OOE+OO 

7. OOE-Ol 

3. OOE-Ol 

0.00E+00 

0.00E+00 

O.OOE+OO 

O.OOE+OO 

1 . 00E+00 

SYMBOL  PROBABILITY  MATRIX  (TRANSPOSED): 
9.00E-01  1.00E-01  O.OOE-t-OO  O.OOE+OO 

O.OOE+OO 

1 .OOE-Ol 

6. 00E-01 

O.OOE+OO 

O.OOE+OO 

O.OOE+OO 

O.OOE-t-OO 

2. 00E-01 

3. OOE-Ol 

O.OOE+OO 

O.OOE+OO 

O.OOE-t-OO 

1 .00E-01 

6. 00E-01 

1  .OOE-Ol 

O.OOE+OO 

0.00E+00 

O.OOE+OO 

1 .OOE-Ol 

2. OOE-Ol 

O.OOE+OO 

0.00E+00 

0.00E+00 

O.OOE+OO 

4. OOE-Ol 

1 .OOE-Ol 

0.00E+00 

O.OOE+OO 

O.OOE+OO 

3. OOE-Ol 

6. OOE-Ol 

0.00E+00 

O.OOE+OO 

O.OOE+OO 

O.OOE+OO 

3. OOE-Ol 

consequently,  significantly  less  interesting.  Insight  into  the  length  of 
the  transient  portion  of  HMM(3)  observation  sequences  is  gained  by 
estimating  the  first  passage  time  of  HMM(3)  into  its  fifth  state,  that  is, 
the  number  of  transitions  in  the  Markov  process  before  its  fifth  state  is 
entered.  The  mean  and  variance  of  first  passage  times  may  be  computed 
explicitly;6  however,  simulation  was  used  here  instead.  In  10000 
observation  sequences  generated  for  HMM(3),  it  was  found  that  the  mean  and 
standard  deviation  of  the  first  passage  time  was  10.9  and  4.8, 
respectively.  The  least  first  passage  time  was  3  transitions,  and  the 
largest  first  passage  time  was  43  transitions.  Thus,  observation  sequences 
almost  certainly  become  stationary  for  t  >  50. 

Figure  4  and  table  5  clearly  show  that  dF 33( x )  is  3  "well-behaved" 
distribution  even  though  HMM( 3)  is  not  ergodic.  However,  dF33(x)  is  not 
as  closely  approximated  by  a  log-normal  distribution  as  are  d F 2 1 ( x )  and 
d F 2 2 ( X )  ’  35  evidenced  the  discrepancy  in  table  5  between  the  sample 
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Figure  4.  Histogram  of  10000  Values  of  log  dF33(x)  for  T  =  25. 
(The  Normal  Curve  Has  the  Sample  Mean  and  Variance  Given  in  Table  5.) 


statistics  and  the  statistics  that  would  hold  if  dF33(x)  were  truly 


log-normal 


Ten-thousand  observation  sequences  of  HMM(l)  and  HMM(2)  were  generated 


and  the  posterior  likelihood  was  computed  using  the  forward- 


backward  algorithm.  The  observation  sequences  are  thus  mismatched  to  the 
posterior  likelihood  function.  Table  6  gives  the  number  of  sequences  for 


which  f3(°y)  =  Better  than  99-percent  rejection  of  the  simulated 


ergodic  HMM  observations  was  attained  when  T  -  10,  that  is,  when  the 
observation  sequences  were  about  as  long  as  the  mean  first  passage  time  of 
HMM(3)  into  state  5.  Total  rejection  of  the  10000  ergodic  observations 
occurred  for  T  =  20. 


The 
much  more 


ability  of  ^3(°y)  to  reject  observations 


of 


0r  c  HMM( 2) 


i  s 


impressive 


lack  of  symmetry  F.^(x) 


than 

* 


the 


f^O-j.)  rejection  of  0^  e  HMM(3).  The 


Fj.(x)  is  striking  in  this  instance. 


Table  7 


gives  two  estimates  of  the  mean  and  standard  deviation  of  log  dF^3(x),  and 


figure  5  is  a  histogram  of  the  case  T  -  25.  The  mean  values  of  the  10000 
samples  and  those  predicted  by  equation  (28)  agree  very  well;  however. 
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Table  5.  Comparison  of  Two  Estimates  for  the  Mean  and  Standard 

Deviation  of  log  dF33(x) 


Mean 

Value 

Standard 

Deviation 

T 

Sample 

Eq.  28 

Sample 

Eq.  29 

5 

-  5.6 

-  4.9 

1  .92 

1  .13 

10 

-12.8 

-11.8 

2.30 

1  .91 

15 

-18.6 

-18.2 

2.72 

2.33 

20 

-23.5 

-22.6 

3.22 

2.29 

25 

-28.1 

-26.3 

3.61 

2.15 

50 

-50.5 

-47.2 

4.59 

2.75 

Table  6. 

Number  of  Oj  e  HMM(i) 
f3(0T)  =  0.  i  =  1  ,  2 

for  Which 

T 

HMM(l) 

HMM( 2 ) 

5 

9389 

9172 

10 

9937 

9918 

15 

9997 

9988 

20 

10000 

10000 

Table  7.  Comparison  of  Two  Estimates  for  the 
and  Standard  Deviation  of  log  dF£3(x) 

Mean 

Mean 

Value 

Standard  Deviation 

T 

Sample 

Eq.  28 

Sample 

Eq.  29 

5 

-  10.8 

-10.8 

0.51 

0.60 

10 

-  21  .4 

-21  .4 

0.91 

0.89 

15 

-  32.0 

-32.0 

1  .02 

1  .04 

20 

-  42.6 

-42.7 

1  .04 

1.15 

25 

-  53.2 

-53.5 

1  .05 

1.12 

50 

-106.4 

-106.8 

1.11 

1  .43 

dF  (x)  is  not  as  well  approximated  by  a  log-normal  as  d  F  ( x )  and 

co  Cu 

dF^(x),  as  seen  from  the  discrepancy  in  the  sample  versus  the  predicted 
standard  deviations.  In  any  event,  it  is  clear  by  comparing  table  7  with 
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Figure  5.  Histogram  of  10000  Values  of  log  dF23(x)  for  T  =  25. 

(The  Normal  Curve  Has  the  Sample  Mean  and  Variance  Given  in  Table  7.) 

the  lower  half  of  table  3  that  f2(0T)  cannot  reliably  distinguish  0T  e 
HMM(3)  from  0T  e  HMM(2)  when  T  =  50.  However,  since  the  first  passage 
time  of  HMM( 3)  is  almost  certainly  less  than  T  =  50,  increasing  the 
observation  sequence  length  to  improve  reliability  is  not  appropriate  if  the 
underlying  intent  is  the  classification  of  the  transient  portion  of  HMM(3). 

C.  LEFT-TO-RIGHT  HMM  WITH  NOISE 

In  this  example,  the  effect  of  noise  on  the  maximum  posterior  likelihood 
classifier  is  assessed  for  the  left-to-right  model  HMM( 3) .  The  right  way  to 
study  noise  in  finite  symbol  HMMs  is  to  add  the  noise  to  the  original  time 
series  s(t)  and  then  analyze  the  particular  preprocessor  under  consideration 
to  determine  the  noisy  symbol  sequence.  However,  no  particular  preprocessor 
is  proposed  here,  and  so  we  resort  to  modeling  noise  in  much  the  same  way 
that  Shannon  modeled  noisy  discrete  memoryless  channels.7  This  approach 
can  give  an  indication  of  the  successful  classification  rate  as  a  function 
of  the  probable  number  of  incorrect  symbols  in  an  observation  sequence,  but 
it  cannot  provide  an  assessment  of  the  effect  of  signal-to-noise  ratio  on 
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classification  because  such  an  assessment  requires  knowledge  of  the 
preprocessor. 

Denote  by  h^  the  probability  that  the  observation  symbol  is 

altered  to  symbol  V.  by  the  noise  mechanism  and  define  the  m-by-m  noise 

J 

probability  matrix  H  =  [h,  .].  It  is  assumed  that  H  is  independent  of  the 

K  J 

state  of  the  Markov  chain  and  of  time  t.  Consequently,  the  output  of  a 
given  HMM  corrupted  by  noise  is  equivalent  to  another  HMM  that  is 
noiseless.  If  x  =  (ir.  A,  B)  are  the  parameters  of  a  given  HMM  with  noise 
matrix  H,  the  parameters  of  the  equivalent  noiseless  HMM  are  X  =  (it,  A, 
BH).  The  proof  is  straightforward:  the  product  b-k  h  is  the 
probability  that  the  state  of  the  Markov  chain  is  i  and  that  symbol  j  is 
produced,  given  that  symbol  k  was  the  output  of  the  given  HMM.  The  sum  over 
k  of  b^k  gives  the  component  b.^  of  the  equivalent  noiseless  HMM 
symbol  probability  matrix  B.  Clearly,  tX  equals  the  (i,j)  component  of 
the  product  BH,  so  that  B  =  BH. 

The  noise  probability  matrix  H  must  be  row  stochastic;  that  is,  every 
row  sum  must  equal  one.  The  HMM-generated  symbol  is  altered  by  noise 
to  one  of  the  available  symbols,  so  that  row  k  must  sum  to  one. 

Because  H  has  row  sums  equal  to  one,  the  matrix  B  is  a  valid  symbol 
probability  matrix  for  the  equivalent  noiseless  HMM;  that  is,  each  row  of 
B  =  BH  sums  to  one.  We  have 


m  m 


j=l  j=l  k=l 


k  hkj 


111  Ml 

ixz 

k=l  j=l 
m 

IX 


=  i  . 
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The  worst  case  noise  probability  matrix,  denoted  H  ,  has  the  constant 
entry  h?^  =  1/m  for  all  i  and  j.  In  this  case, 


III  III 

**ij  ~  ^  >  bik  hkj  '  m  ^  >  bik  "  m  * 


Consequently,  HMMs  with  noise  probability  matrix  H  are  indistin¬ 
guishable.  In  fact,  makes  all  HMMs  statistically  equivalent  to  the 
ergodic  HMM(l)  given  in  table  1. 

Let  Pr[V..]  be  the  relative  frequency  of  occurrence  of  the  symbol 

ir.  observation  sequences  of  length  T  before  the  addition  of  noise.  Thus,  we 

have  Z  Pr[V..]  =  1.  After  alteration  by  noise,  the  probability  of 

correct  occurrences  of  V.  in  0,  is  then  Pr[V.l  h...  The  probability 

i  T  i  n 

that  the  symbol  0(t)  e  0  is  correct  is 


>t  hii 


and  the  probability  that  0(t)  is  incorrect  is 


ET  =  1  -  °T  • 


For  the  examples  here,  given  a  specific  value  of  ET,  we  choose  the  simple 
noise  probability  matrix  H  defined  by 


alii  , 

all  i  *  j  . 


(32) 


For  this  choice  of  H,  0^  is  independent  of  the  actual  values  of  Pr[V,], 
as  is  clear  from  equation  (30)  and  the  fact  that  Z  Pr[V  ]  =  1. 
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Noise  tends  to  make  observations  of  all  HMMs  look  like  observations  of 
HMM(l),  and  ergodic  observation  sequences  tend  to  have  forbidden  symbol 
sequences  for  the  left-to-right  HMM(3) .  The  first  natural  issue  is 
therefore  to  determine  how  many  forbidden  symbol  sequences  occur  as  a 
function  of  the  incorrect  symbol  probability  Table  8  gives  the 

results  for  various  values  of  T  and  E^,  based  on  simulations  of  10000 
observation  sequences.  It  shows  that  forbidden  symbol  sequences  are  less 
likely  for  small  T  than  for  large  T.  This  table  also  shows  that  noisy 
observations  of  HMM(3)  do  not  have  as  high  a  proportion  of  forbidden  symbol 
sequences  as  observations  of  HMM(l)  and  HMM(2) ,  even  for  E^  =  10  percent, 
as  can  be  seen  by  comparing  tables  6  and  8.  One  may  conclude  from  table  8 
that  E-j-  must  be  small  and  T  must  be  short  to  minimize  misclassif ication 
due  to  forbidden  symbol  sequences.  For  instance,  if  T  =  25  and 
Et  =  0.001,  the  misclassif ication  rate  is  apparently  at  least 
1.21  percent.  Shorter  T,  however,  causes  smaller  shifts  in  the  statistics 
in  the  likelihood  function  and  thus  increases  the  misclassif ication  rate. 
Consequently,  a  tradeoff  exists  between  short  T  and  long  T. 

The  total  misclassif  ication  rate  can  be  expressed  as  the  sum  of  the 
misclassification  rate  due  to  forbidden  symbol  sequences  and  the 
misclassification  rate  due  to  noise-induced  shift  in  the  statistics  of  the 
nonzero  values  of  the  posterior  likelihood  function.  We  examine  the  total 
misclassification  rate  for  HMM(4),  which  is  defined  to  be  the  HMM  equivalent 


Table  8.  Number  of  0-j-  e  HMM(3)  +■  Noise 
for  Which  f 3 ( 0 j )  =  0  at  Various  Values  of  Ej 


Et 

T 

0.1 

0.01 

0.001 

0.0001 

5 

2194 

236 

23 

1 

10 

3906 

443 

37 

1 

15 

5305 

651 

64 

11 

20 

6625 

986 

103 

13 

25 

7643 

1303 

121 

11 

50 

9643 

2684 

345 

34 
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to  HMM( 3)  with  the  noise  matrix  H  given  by  equation  (32)  with  E^.  =  0.001. 
The  parameters  of  HMM(4)  are  given  explicitly  in  table  9. 


Denote  by 


F  u<» 


the  cumulative  distribution  function 


Pr[0  <  f.(0T)  <  x  |  0T  c  HMM(j)]  , 

0, 


for  x  >  0  , 


for  x  <  0  . 


(33) 


Ten-thousand  observation  sequences  0^  were  generated  from  HMM(4)  for  1  = 
25.  As  given  in  table  8,  121  sequences  resulted  in  zero  posterior 
likelihood  function  values  (that  is,  f3(0y)  =  0)  and  the  remaining  98/9 
nonzero  values  of  f^O^)  ?Pve  histogram  shown  in  figure  6.  By 

comparison  with  figure  4,  it  is  clear  that  no  significant  difference  between 


log  dF34(x) 


and 


log  dF33(x) 


is  evident.  Therefore,  the  misclassifi 


Table  9.  Parameters  of  HMM(4),  Rounded 
to  Three  Significant  Digits 


NUMBER  OF  MARKOV  STATES  *  5 
NUMBER  OF  SYMBOLS  PER  STATE  =  8 
INITIAL  STATE  PROBABILITY  VECTOR: 


1 . QOE+OO 

0. 00E+00 

0.00E+00 

0.00E+00 

0.00E+00 

TRANSITION 

PROBABILITY  MATRIX: 

6.00E-01 

4.00E-01 

0.00E+00 

0.00E+00 

0.00E+00 

0.00E+00 

7.00E-01 

2.00E-01 

1 .00E-01 

0.00E+00 

0 .  QOE+OO 

0. 00E+00 

6.00E-01 

4.00E  -01 

0.00E+00 

0 .OOE+OO 

0 . 00E+00 

0.00E+00 

7.00E-01 

3.00E-01 

0. 00E+00 

0. OOE-t-OO 

0.00E+00 

0.00E+00 

1 . OOE+OO 

SYMBOL  PROBABILITY  MATRIX  (TRANSPOSED): 


8 

.99E 

-01 

1 

.  00E 

-01 

1 

.  43E 

-04 

1 

.  43E 

-04 

1 

.  43E 

-04 

1 

.  00E 

-01 

5 

.  99E 

-01 

1 

.  43E 

-04 

1 

.  43E 

-04 

1 

.  43E 

-04 

1 

.  43E 

-04 

2 

.  00E 

-01 

3 

.  00E 

-01 

1 

.  43E 

-04 

1 

.  43E 

-04 

1 

.  43E 

-04 

1 

.  00E 

-01 

5 

.  99E 

-01 

1 

.  00E 

-01 

1 

.  43E 

-04 

1 

.  43E 

-04 

1 

.  4  3  E 

-04 

1 

.  00E 

-01 

2 

.  00E 

-01 

1 

.  43E 

-04 

1 

.  43E 

-04 

1 

.  43E 

-04 

1 

.  43E 

-04 

4 

.  00E 

-01 

1 

.  00E 

-01 

1 

.43E- 

-04 

1 

.  43E 

-04 

1 

.  4  3  E 

-04 

3 

.  00E 

-01 

5 

.  99E 

-01 

1 

.43E 

-04 

1 

.  4  3E 

-04 

1 

.43E 

-04 

1 

.  43E 

-04 

3 

.  00E 

-01 
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(The  Normal  Curve  Has  the  Sample  Mean  -  -28.156  and 
the  Variance  =  3.6167.) 

cation  rate  due  to  noise-induced  shifts  in  the  statistics  of  dF^4(x)  is 
effectively  zero.  The  maximum  posterior  classifier  is  thus  98.8  percent 
reliable  when  used  with  noisy  observations  characterized  by  the  noise 
probability  matrix  H. 


Because  =  0.001  in  this  example,  each  observation  sequence  0^^ 
has  probability  0.025  of  having  at  least  one  incorrect  symbol.  Of  10000 
observation  sequences,  the  expected  number  with  at  least  one  incorrect 
symbol  is  250.  Nearly  half  (121)  contained  forbidden  symbol  sequences  and 
caused  the  only  significant  mi  sc lassi f icat i on  problem.  The  other  half 
apparently  made  no  contribution  to  mi  sc  lass i f icat i on . 


It  would  be  desirable  to  be  able  to  compute  the  moments  of  F.j(x) 
instead  of  F„(x).  Alternatively,  it  would  be  desirable  to  be  able  to 
compute  the  amplitude  of  the  impulse  (delta  function)  in  dF.  (x)  that 
seems  to  be  present  in  the  lef t-to-right  HMMs  considered  here.  In  other 


words,  if  we  write 


dF  .  j ( x )  =  A  6(x)  ♦  dF".(x) 


(34) 
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then  an  algorithm  to  compute  A  directly  would  be  worthwhile.  Knowing  A  and 
the  moments  of  F.j  gives  the  moments  of  F^(x).  However,  developing 
such  an  algorithm  requires  further  work. 

IV.  CONCLUSIONS 

The  first  two  moments  M.j(l,T)  and  M.j(2,T)  of  F.j(x)  give  good 
estimate,  of  the  probability  density  function  dF^.(x)  in  the  case  when 
HMM( i )  and  HMM(j)  are  both  ergodic.  The  reason  is  that  dF..(x)  is 

apparently  nearly  log-normal.  The  evidence  supporting  this  claim  is 
strictly  empirical  and  a  proof  of  the  degree  of  approximation  of  dF..(x) 
to  log-normal  would  be  worthwhile.  When  either  HMM(i)  or  HMM(j),  or  both, 
are  not  ergodic,  dF.j(x)  does  not  approximate  the  log-normal 

distribution.  Consequently,  higher  order  moments  are  needed  to  develop 
reasonable  continuous  approximations  to  dF^(x). 
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